Factorization of Language Models through Backing-Off Lattices 



Wei Wang 
Department of Computer Science 
New York University 
{wei}@cs .nyu. edu 



Abstract 

Factorization of statistical language models 
is the task that we resolve the most discrim- 
inative model into factored models and de- 
termine a new model by combining them 
so as to provide better estimate. Most of 
previous works mainly focus on factorizing 
models of sequential events, each of which 
allows only one factorization manner. To 
enable parallel factorization, which allows a 
model event to be resolved in more than one 
ways at the same time, we propose a gen- 
eral framework, where we adopt a backing- 
off lattice to reflect parallel factorizations 
and to define the paths along which a model 
is resolved into factored models, we use a 
mixture model to combine parallel paths in 
the lattice, and generalize Katz's backing-off 
method to integrate all the mixture models 
got by traversing the entire lattice. Based 
on this framework, we formulate two types 
of model factorizations that are used in nat- 
ural language modeling. 



1 Introduction 

Factorization of statistical language models is the 
task that we resolve the most discriminative model 
into factored models and determine a new model by 
combining them so as to provide better estimate to 
the most discriminative model event. For instance, 



a new model for trigram can be obtained by com- 
bining the factored models: a unigram model, a bi- 
gram model and a trigram model; a model for PP- 
attachment (jCollins k, Brookslll995h can be obtained 
by considering both more discriminative models like 
Pr(l|is, revenue, from, research) 1 and less discrim- 
inative ones like Pr(l \is, from, research); a lexical- 
ized parsing model can be approximated by combin- 
ing a lexical dependency model and a s yntactic struc- 
ture model ijKlein fc Mannin The former 
two examples are usually called backing-off. 

Therefore, factorization of language models should 
answer two questions: how to factorize, and how to 
combi ne. Most of previous wo r ks on language m od- 
eling llChen & Ooodmarlll99Sl) llCxOodmanll^nol fo- 
cus on sequential model event (such as n-gram), and 
thus need not to answer the first question because 
the sequential model event like n-gram gives a nat- 
ural factorization order: an n-gram has exactly one 
type of (n-l)-gram to backoff. However, for nonse- 
quential model event, we need to specify them both. 

In this paper, we formulated a framework for lan- 
guage model factorization. We adopt a backing-off 
lattice to reflect parallel factorization and to define 
the paths along which a model is resolved into fac- 
tored models; we use a mixture model to combine 
parallel paths in the lattice; and generalize Katz's 
backing-off method to integrate all the mixture mod- 
els got by traversing the entire lattice. 

Based on this framework, we formulate two types 
of model factorizations that are used in natural lan- 
guage modeling. 

This example is extracted from ICollins fc Brookstll99.^ . 



1 



The remainder of this paper is organized as follows, 
we first introduce the backing-off lattice, then explain 
the mixture model, next formulate the backing-off 
formula, next describe two types of model factoriza- 
tions, and finally draw the conclusions. 



2 Backing-Off Lattice 

The backing-off lattice specifies the ways how an 
event can be "factorized" into sub-events. Each lat- 
tice node represents a set of factored events 2 . Each 
lattice edge connects a parent node to a child node, 
and represents a factorization manner that factorizcs 
an event in the parent event set into a set of factor- 
ized events represented by its child node. Different 
lattice nodes may have common child. 

In most of previous works, the backing-off lattice is 
only a list, in which no node has more than one edges 
(backing-off paths). Our backing-off lattice is, how- 
ever, an directed acyclic graph (DAG), which means 
a model event represented by a lattice node might 
have several factorization manners. 

Figure shows a backing-off lattice that illustrates 
how to factorize a de pendency event in a b ilexical 
context-free grammar (jEisner fc SattaL Il9 99). Each 
lattice node is denoted by a solid oval and represents 
a set of events, each of which is represented by a 
dotted oval (if there is only one element in the set, 
we omit the dotted oval). Each lattice edge represents 
a factorization manner that resolves an event in the 
parent node (e.g., the left dotted oval in node (3)) 
into a set of factored events (e.g., the set of events in 
node (4)). 

The backing-off lattice should be tailored in ac- 
cordance with the requirement of the task that it is 
applied to. If it is used for smoothing purpose, we 
may want to use each slice of the lattice to repre- 
sent model events with the same specificity, which is 
less than its previous slice. To combine different re- 
sources, we may want to factorize a complex model 
whose statistics are unavailable to factored models 
whose statistics are available. 




Figure 1: A backing-off lattice for factorization of a depen- 
dency event in a bilexical context-free grammar. Each lattice 
node is denoted by a solid oval and represents a set of events, 
each of which is represented by a dotted oval. Each lattice 
edge represents a factorization manner that resolves an event 
in the parent node (e.g., the left dotted oval in node (3)) into 
a set of factored events (e.g., the set of events in node (4)). 

3 Mixture Model 

Through the backing-off lattice, a model is factorized 
recursively into sub-models. Each node can be ap- 
plied with more than one factorization manners. We 
therefore are concerned with the problem of how to 
approximate the model represented by a lattice node 
through the models represented by all its children 
nodes. For instance, how to approximate the distri- 
bution of events in node 1 of Figure ^ with factored 
models represented by nodes 2 and 3. We use a mix- 
ture model to interpolates all the factored models. 
We formulate the mixture model in the following. 

Let £ denote a backed-off event (e.g., the Depen- 
dency event in the lattice root in figure Q}. To get its 
factored events, we introduce a series of factorization 
functions: $i,0 < i < I, each of which corresponds 
to a lattice edge from £, and factorize £ into a set of 
sub model events Si, that is, 



£ 



Si = { eij |0 < j < Ii} 



(1) 



2 We also regard the most discriminative events in a model 
as factored events. 



where, is the j'th sub-event among the set of 
events obtained by factorizing £ using factorization 
function . 
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We can view $^ as a hidden random variable, corre- 
sponding to different a factorization manner. It has a 
prior distribution: Pr(<f>i), specifying the confidence 
of selecting the i'th path to backoff. We can get, 

Pr(Si) = Pr(£|$ 2 ) (2) 

The distribution of £ can be derived in the following 

way, 

Pr(£) = J2 

<S>i,0<i<I 

= Pr(*i)Pr(^l*i) (3) 

*i,0<i</ 

From Formula [3 and 01 we get 

Pr(£)= ^ Pr($,)Pr(^) (4) 

<5>i,0<i<I 

Formula 0| shows that the probabilistic model gov- 
erning event £ is approximated by a mixture of its 
factored models Pi(Si) using normalized coefficients 
(Pr($i)). Each coefficient reflects the confidence of 
selecting a factorization manner. 

The sum of <J?j should be equal to 1. Their 
values can be handcrafted just for simplicity or 
trained from held-out da ta using EM algorithm 
l|A. Dempster, et allll977|) o r other numerica l meth - 
ods such as Powell's method l|W. Press, et all Il986|) . 

If we assume that the factored events in the 
value set of a factorization function are independent 
of each other, we arrive 

is* I 

Pr(£)= J2 Pr(^)II Pr (^) ( 5 ) 

<S>i,0<i<I j 

Formula[S]shows that if we assume the events in the 
value set of each projection function are independent 
of each other, the probability of the value set given 
factorization function is equal to the multiplication 
of the probability of each event in the set. 

We can derive the mixture model for conditional 
distributions similarly. 

Let us give an example to illustrates the above idea. 
In the backing-off lattice shown in Figure ^ f° r the 



event in node 1 to be factored into event sets in node 2 
and 3, respectively, we need to introduce the following 
factorization functions, 

$o= factorize £ into a lexical dependency and a 
syntactic dependency. 

$1= factorize £ into sub-events, each of which 
describes the dependency between (parent or de- 
pendent) lexical head and (dependent or parent) 
nonterminal label. 

These functions will project the event into 

50 = {in -> June, IN -> NN}. (6) 

51 = {in -> NN, IN -> June}. (7) 

where each of the two sets contains two factored de- 
pendency events. 

Then based on Formula 0] we get 

Pi(IN[in] NN[June\) 
= Pr($ ) Pr(m -> June, IN -> NN) 
+ Pr($i) Pr(m NN, IN June) (8) 

Assume that factored events in the same set are 
independent of each other, from Formula [5] we can 
get 

Pt(IN[in] -> NN[June\) 
= Pr($ ) Pr(i« -> June) Pv{IN -> NN) 
+ Pr($i ) Pr(m -> NN)Pi(IN -> June) (9) 

Now, distribution of Pi (IN [in] — > NN[June\) is 
approximated by the mixture of the factored models 
Pr(m -> June), Pr{IN -> NN) , Pr(m -> NN) 
and Pi(IN — > June) that are obtained based on a 
pre-defined backing-off lattice. 

4 Backing- Off Formula 

We have presented a mixture model to approximate 
a more discriminative model with less discriminative 
factored models based on a backing-off lattice and a 
set of factorization functions. A mixture model, how- 
ever, only combines the factored models obtained by 
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factorizing one lattice node and if we traverse the 
backing-off lattice, we will get a series of mixtures. 
We therefore generalize Katz's backing-off method 
ilKatzl Il987l> to organize these mixtures by firing 
correspondent mixture model when backing-off takes 
place. 

Pr(&|£a) = 

bo 

{ ^fr c{e 1 b 2 )>k 

f3 c{£l e 2 )^0 l l<C(£i£ 2 )<K (10) 
[ aAiTXTUlZ£ otherwise 

where, MXXTWR.£ represents the conditional ver- 
sion of the mixture model in Formula if is a fre- 
quency threshold for discounting. £ i and £2 refer to, 
in general, two events that adjacently co-occur in a 
corpus. 

The basic idea of this backing-off method is the 
same as that of Katz's. That is, the backing-off for- 
mula has a recursive format. At each step of the 
recursion, there are three branches associated with 
their firing conditions. If the frequency of the cur- 
rent model event is large enough (such as greater than 
K, Katz used the value of 5 for K), the maximum- 
likelihood estimator (MLE) is used. If the occurrence 
frequency is within the range of [I, K] , the MLE prob- 
abilities are discounted in some manner so that some 
probability mass is reserved for those unseen events. 
If the model event never occurs in the training data, 
we use the estimates from the factored model events. 

The difference therefore lies in the combination of 
estimates of factored events. In traditional backing- 
off methods, there is only one backing-off path to 
go when the backing-off condition is satisfied. For 
example, an n-gram only has exactly one (n-l)-gram 
to be backed-off. However, in our case, we have more 
than one backing-off paths to go through. None is a 
branch of another. Then the mixture model obtained 
in the previous section is embedded here. 

j3 are for normalizat ion and can be computed ac- 
cording to 

a is also for normalization. It is computed from 
the amount of reserved probability mass for unseen 
events, a is a function of £2 because £2 is the given 



event of a conditional probability, and each condi- 
tional probability should satisfy the normalization re- 
quirement. It is computed similarly to that in Katz 
original paper: 

a = a(£ 2 ) 

_ 1 ~ S£ 1 ,£ 2 :C(£i£ 2 )>0 Prfco ( gl l g2 ) 

E £u£2 :C( £l£2) =oMIXTUTZ£ 

1 ~ T,Ei,E 2 :C(SiS 2 )>0 Pno(£l\£2) 

l-E £u£2 :C( £l£2 )> MlXTini£ [ > 

5 Model Factorizations 

Now that we have presented a framework that allows 
a model event to be factorized along more than paths 
and combines different paths in a backing-off formula, 
we now formulate two types of model factorization 
that are used in natural language modeling. We first 
introduce some notations. 

Let a matrix yV[ mxn of random variables represent 
a linguistic object that simultaneously expresses two 
types of information in its row and column directions. 
For example, matrix 3 , 

nonstop flights , 10 , 
sin vuelos 

denotes two lexical dependencies, each of which is 
the translation the other. It expresses the depen- 
dency relationship information in the row direction, 
and translation relationship information in the col- 
umn direction. 

Let A G M mxn , and £ 4 6 M mXn , we want to de- 
termine Pr(„4|£>) using its factored models. We can 
either factorize both A and B synchronously, or fac- 
torize only the conditioning event B, which results in 
two types of factorizations for different tasks: syn- 
chronous factorization, and asynchronous factoriza- 
tion. 

3 T his example is due to t he hierarchical alignment in Figure 
10 in JAlshawi. et alll2000l) 

4 We let A- and B have identical distributions for simplicity 
of explanation. They needn't have to be identical in general. 
And matrix M need not to have only row and column, but 
might be like M mxnxi - . 
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5.1 Synchronous factorization 

In synchronous factorization, both A and B are fac- 
torized in the same manner according to some cor- 
respondence assumption, and the factored models 
determines the marginal information of the entire 
model. 

Based on Formula [3] (the mixture model), and the 
assumption that A and B are in sync with with each 
other on row (or column), we formulate the syn- 
chronous factorization as follows: 



Pt(A\B) = 



Pr f$> 

Pr ($ co ; 



■)Pr (siL 

)Pr (S± 



b ) 

i t'll' I 
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(13) 



Under factorization manner $ roiU) 
mula 1 1 31 1 1 41 and ITBl we get 



where 



Pr 



= Pr($ row )Pr(5^|5 f 



and from For- 



( 


nonstop 






flights 


) 




sin 






vuelos 





> B ) 

row J 



(16) 



row 



where, 



^row(A) — {[nonstop] , [sin]} (17) 
S? ow = §ro W {B) = {[flights] , [vuelos]} (18) 

and $ rOT = 1 (normalized). 

If we assume that an element in S^ ow is only de- 
pendent on the correspondent element in Sf ow , we 
get, 



$>row projects A and B on row, respectively, and 
therefore results in row vectors: 

- Sfow= {(Ai)ix «|1 <i<m} 
the set of row vectors of A 

- S? ow = {(B l3 ) lxn \l<i<m} 
the set of row vectors of B 

$ co / projects A and B on column, respectively, 
and therefore results in column vectors: 

- Sf ol = {(A t] ) mxl |l<i<n} 
the set of column vectors of A 

- Sg { = {(/%) mxl |l<j<n} 
the set of column vectors of B 



Pr (S^,„\S' 



3 B ) 

row } 



Factored models Pr (Sf ow \S? ow ) and Pr {SfjS* ol ) 
will be further factorized by other factorization func- 
tions, and the results of these factorization functions 
constitute the backing-off lattice. All these factored 
models are then combined by the backing-off formula. 

Let us give an example. Let 



and let 



A = 



B 



nonstop 
sin 



flights 
vuelos 



(14) 



(15) 



Pr (nonstop\ flights) 

Pr (sin\vuelos) (19) 



Formula 1161 and HD1 indicate that, by synchronous 
factorization, a bilin gual lexical d ependency model 
like that in ijAlshawi. et all l2000|) is approximated 
by two factored lexical dependency models , each of 
which corresponds to one language. 

The factorization of a bilexical context-free gram- 
mar into a lexical d ependency model and a s yntactic 
structure model in jKlein k MannineL 12002^ is actu- 
ally synchronous factorization. 

Synchronous factorization is usually used for infor- 
mation combination in the cases that we only have 
the statistics of those factored models and want to use 
them to approximate a more complex model; or that 
we want to simplify a complex model into factored 
mode ls to gain efficiency (e.g., ein & Manning, 
l2002ft . 

5.2 Asynchronous factorization 

Another type of factorization that is frequently used 
in statistical language modeling is asynchronous fac- 
torization, where only the conditioning event of the 
conditional probability Pr(.4|B) is recursively factor- 
ized while keeping the conditioned event fixed. The 
following formula describes the idea. 
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6 Related Works 



\B\ 

Yr{A\B) = J2 ( Pr (^) Pr(4|*<(B))) (20) 

i=l 

where $i = "drop the «'th element in matrix B" , 
and \B\ is the number of the elements in matrix B. 
If we further factorize model Pr(„4|$i (£>)), matrix B 
will be a partial matrix that contains a part of ele- 
ments of the original matrix. 

Formula 1201 indicates that the original model is re- 
cursively factorized into sub-models, and each factor- 
ization recursion step has \B\ factorization manners, 
each of which only drops one element from B. 

In the following, we give an example to il- 
lustrate the idea of asynchronous factorization. 
In PP-attachment, suppose we want to factorize 
model Pr(0|is, revenue, from, research) 5 , which de- 
termines the probability of the attachment of prepo- 
sition phrase "from research" to the noun "revenue" 
instead of to the verb "is" . Based on FormulaEOI we 
can get, 

Pr(l|is, revenue, from, research) = 
^revenue Pr(l|is, from, research) 
+$research Pr(l|«s, revenue, from) 
+3>is Vv(l\revenue, from, research) (21) 

where & W ord refers to the factorization function that 
drops word from the conditioning event of the left 
hand side model. And the factored models on the 
right hand side can be further factorized by continu- 
ing to traversing the backing-off lattice. 

In contrast to that synchronous factorization is 
used for information combination, asynchronous fac- 
torization is usually used for smoothing purpose. 

In practice, there might be a compromise between 
the above two, where we factorize both the condi- 
tioned and conditioning events, but not in a syn- 
chronous manner. And matrices A and B need not 
to have the same number of rows and columns. 

5 Once again, th is example is extracted from 
ICollins fc BrookdlT995l) . 



ilCollins fc Brooksi llQQfil) puts forward a method, 
which actually is asynchronous factorization, for 
backing-off of models of prepositional phrase attach- 
ment, providing a way to mixing the frequencies of 
different backing off choices at certain recursion step 
by dividing the sum of frequencies of all the more 
discriminative model events by those of less discrim- 
inative ones if the sum of the frequencies of the less 
discriminative ones are greater than zero, otherwise, 
backing-off continues on. One characteristic of this 
method is that if one of the less discriminative model 
events has non-zero frequency, the backing-off termi- 
nates, no matter whether other events in the same 
backing-off level are zero or not, whereas the mixture 
model we introduced to combine parallel backing-off 
paths is able to to make those zero count event fur- 
ther backoff so that they still can contribute to the 
final result. And we think this is necessary when we 
use the backing-off framework for information com- 
bination. 

It turned out that l|Bilmes fc KirchhoflEool were 
independently working on some similar ideas. They 
introduced factored models and also generalized the 
backing-off framework to handle parallel backing-off 
paths. The differences between their work and ours 
are (1) Their factored models can actually be catego- 
rized into the asynchronous factorization type, where 
only the conditioning matrix (the feature vector in 
their paper) is factorized; (2) We also formulated 
the synchronous factorization type, where both the 
conditioned and conditioning matrices are factorized 
synchronously. And we showed that this is useful 
for combining different information sources; (3) We 
use a mixture model to combine parallel paths while 
they selected the the path with the maximum value. 
We think that combining the contribution of each 
backing-off path is useful when we want to combine 
different information resources; (3) In our framework, 
the result of each factorization function (backing-off 
path) is a set of events (See Formula , not merely 
one event. And this is usefull when we do th e kind 
of factorization like l|Klein fe MannineL [2002) using 
Formula [SJ 
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7 Conclusions W. Press, B. Flannery, S. Tcukolsky, and W. Vetterling (1986). 

"Numerical Recipes, The Art of Scientific Computing," 

We have presented a framework for language model Cambridge University Press. 

factorization. We adopt a backing-off lattice to reflect 

parallel factorizations and to define the paths along 

which a model is resolved into factored models, we 

use a mixture model to combine parallel paths in the 

lattice, and generalize Katz's backing-off method to 

integrate all the mixture models got by traversing the 

entire lattice. 

Based on this framework, we formulate two types 
of model factorizations that are used in natural lan- 
guage modeling. 
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